System and method for annotating multi-modal characteristics in multimedia documents

ABSTRACT

A manual annotation system of multi-modal characteristics in multimedia files. There is provided an arrangement for selection an observation modality of video with audio, video without audio, audio with video, or audio without video, to be used to annotate multimedia content. While annotating video or audio features in isolation results in less confidence in the identification of the features, observing both audio and video simultaneously and annotating that observation results in a higher confidence level.

FIELD OF THE INVENTION

[0001] The present invention relates to the computer processing ofmultimedia files. More specifically, the present invention relates tothe manual annotation of multi-modal events, objects, scenes, and audiooccurring in multimedia files.

BACKGROUND OF THE INVENTION

[0002] Multimedia content is becoming more common both on the World WideWeb and local computers. As the corpus of multimedia content increases,the indexing of features within the content becomes more and moreimportant. Observing both audio and video simultaneously and annotatingthat observation results in a higher confidence level.

[0003] Existing multimedia tools provide capabilities to annotate eitheraudio or video separately, but not as a whole. (An example of avideo-only annotation tool is the IBM MPEG7 Annotation Tool, inventorsJ. Smith et al., available through[http://]www.alphaworks.ibm.com/tech/videoannex. Other conventionalarrangements are described in: Park et al, “iMEDIA-CAT: IntelligentMedia Content Annotation Tool”, Proc. International Conference onInductive Modeling (ICIM) 2001, South Korea, November 2001; and Minka etal., “Interactive Learning using a Society of Models,” PatternRecognition, Vol. 30, pp. 565, 1997, TR #349.

[0004] It has long been recognized that annotating video or audiofeatures in isolation results in a less confidence of the identificationof the features.

[0005] In view of the foregoing, a need has been recognized inconnection with providing improved systems and methods for observing andannotating multi-modal events, objects, scenes, and audio occurring inmultimedia files.

SUMMARY OF THE INVENTION

[0006] In accordance with at least one presently preferred embodiment ofthe present invention, there are broadly contemplated multimediaannotation systems and methods that permit users to observe solelyvideo, video with audio, solely audio, or audio with video and toannotate what has been observed.

[0007] In one embodiment, there is provided a computer system which hasone or more multimedia files that are stored in a working memory. Themulti-modal annotation process displays a user selected multimedia file,permits the selection of a mode or modes to observe the file content,annotates the observations; and saves the annotations in a workingmemory (such as a MPEG-7 XML file).

[0008] In summary, one aspect of the invention provides an apparatus formanaging multimedia content, the apparatus comprising: an arrangementfor supplying multimedia content; an input interface for permitting theselection, for observation, of at least one of the following modesassociated with the multimedia content: an audio portion that includesvideo; and a video portion that includes audio; and an arrangement forannotating observations of a selected mode.

[0009] A further aspect of the invention provides a method of managingmultimedia content, the method comprising the steps of: supplyingmultimedia content; permitting the selection, for observation, of atleast one of the following modes associated with the multimedia content:an audio portion that includes video; and a video portion that includesaudio; and annotating observations of a selected mode.

[0010] Furthermore, an additional aspect of the invention provides aprogram storage device readable by machine, tangibly embodying a programof instructions executable by the machine to perform method steps formanaging multimedia content, the method comprising the steps of:supplying multimedia content; permitting the selection, for observation,of at least one of the following modes associated with the multimediacontent: an audio portion that includes video; and a video portion thatincludes audio; and annotating observations of a selected mode.

[0011] For a better understanding of the present invention, togetherwith other and further features and advantages thereof, reference ismade to the following description, taken in conjunction with theaccompanying drawings, and the scope of the invention will be pointedout in the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012]FIG. 1 is a block diagram depicting a multi-modal annotationsystem.

[0013]FIG. 2 is an illustration of a system annotating video scenes,objects, and events.

[0014]FIG. 3 is an illustration of a system annotating audio with video.

[0015]FIG. 4 is an illustration of a system annotating audio withoutvideo.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0016]FIG. 1 is a block diagram of one preferred embodiment of amulti-modal annotation system in accordance with the present invention.The multimedia content and previous annotations are stored on thestorage medium 100. When a user 130 selects a multimedia file via theannotation tool from the storage medium 100, it is loaded into workingmemory 110 and portions of it displayed in the annotation tool 120. Atany time, the user 130 may also request that previously savedannotations associated with the current multi-modal file be loaded fromthe storage medium 100 into working memory 110. The user 100 views themultimedia data by making requests through the annotation tool 120. Theuser 130 then annotates his observations and the annotation tool 120saves these annotations in working memory 110. The user can at anytimerequest the annotation tool 120 to save the annotation on the storagemedium 100.

[0017]FIG. 2 is an illustration of a system annotating video scenes,objects, and events. (Simultaneous reference should also be made to FIG.1.) The multimedia data has been loaded from the storage medium 100 intoworking memory 110. A video tab 290 has been selected. The multimediavideo has been segmented using scene changed detection into shots. Ashot list window 200 displays a portion of the shots in the multimedia.Here, the user 130 has selected a shot 210 which is highlighted in theshot list window 200. A key frame 220, which is a representative shot inthe frames of a shot, is preferably displayed. In addition, the framesof the shot maybe viewed in the video window 230 using play controls240. The video can be viewed with or without audio depending upon theselection of a mute button 250. The user 130 may select annotations forthis shot by clicking the boxes in events 260, static scenes 270, or keyobjects 280 lists of boxes. Any significant observations which are notcontained in the check boxes can be noted in a keywords text box 300.

[0018]FIG. 3 is an illustration of the system annotating audio withvideo. (Simultaneous reference should also be made to FIG. 1.) Themultimedia data has been loaded from the storage medium 100 into workingmemory 110. The audio with video tab 370 has been selected. Themultimedia video has been segmented using scene change detection intoshots. The shot list window 200 displays a portion of the shots in themultimedia. The shot 210 associated with the current audio position ishighlighted in the shot list window 200. The audio data is displayed inthe window 390. A segment of audio 340 has been delimited forannotation; that is, the limits or bounds of the audio has been fixedfor subsequent annotation. The video associated with the audio is shownin 230. As the user 130 uses the play controls 360, the audio datadisplay 390 is updated to display the current audio data and the videowindow 230 changes to reflect the current video frame. Thus, the user130 may observe the Video and simultaneously hear the audio while makingaudio annotations. The user 130 preferably uses the buttons 350 todelimit audio segments. Check boxes corresponding to the foregroundsounds (320) (the most prominent sounds in the segment) and backgroundsounds (330) (sounds which are present but are secondary to othersounds) may be checked to indicated sounds heard within the audiosegment 340. Any significant observations which are not contained in thecheck boxes can be noted in keywords text box 300.

[0019]FIG. 4 is an illustration of the system annotating audio withoutvideo. (Simultaneous reference should be made to FIG. 1.) The multimediadata has been loaded from the storage medium 100 into working memory110. Audio-without-video tab 400 has been selected. The audio data isdisplayed in the window 390. A segment of audio 340 has been delimitedfor annotation. As the user 130 uses the play controls 360, the audiodata display 390 is updated to display the current audio data. Thus, theuser 130 may only hear the audio while making audio annotations. Theuser 130 uses the buttons 350 to delimit audio segments. The check boxesfor foreground sounds 320 and background sounds 330 may be checked toindicate sounds heard within the audio segment 340. Any significantobservations which are not contained in the check boxes can be noted inthe keywords text box 300.

[0020] It is to be understood that the present invention, in accordancewith at least one presently preferred embodiment, includes anarrangement for supplying multimedia content, an input interface forpermitting the selection, for observation, of a mode associated with themultimedia content, and an arrangement for annotating observations of aselected mode. Together, these elements may be implemented on at leastone general-purpose computer running suitable software programs. Thesemay also be implemented on at least one Integrated Circuit or part of atleast one Integrated Circuit. Thus, it is to be understood that theinvention may be implemented in hardware, software, or a combination ofboth.

[0021] If not otherwise stated herein, it is to be assumed that allpatents, patent applications, patent publications and other publications(including web-based publications) mentioned and cited herein are herebyfully incorporated by reference herein as if set forth in their entiretyherein.

[0022] Although illustrative embodiments of the present invention havebeen described herein with reference to the accompanying drawings, it isto be understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beaffected therein by one skilled in the art without departing from thescope or spirit of the invention.

What is claimed is:
 1. An apparatus for managing multimedia content,said apparatus comprising: an arrangement for supplying multimediacontent; an input interface for permitting the selection, forobservation, of at least one of the following modes associated with themultimedia content: an audio portion that includes video; and a videoportion that includes audio; and an arrangement for annotatingobservations of a selected mode.
 2. The apparatus according to claim 1,wherein said input interface permits the selection, for observation, ofboth of the following associated with the multimedia content: an audioportion that includes video; and a video portion that includes audio. 3.The apparatus according to claim 1, wherein said input interfaceadditionally permits the selection, for observation, of solely a videoportion of multimedia content.
 4. The apparatus according to claim 1,wherein said input interface additionally permits the selection, forobservation, of solely an audio portion of multimedia content.
 5. Theapparatus according to claim 1, wherein said arrangement for supplyingmultimedia content comprises a working memory which stores multimediafiles.
 6. The apparatus according to claim 1, wherein said inputinterface is adapted to: first permit the selection of a multimedia fileand then permit the selection of said at least one of: an audio portionsimultaneously with video; and a video portion simultaneously withaudio.
 7. The apparatus according to claim 1, further comprising aworking memory for saving the annotated observations of a selected mode.8. The apparatus according to claim 1, wherein said input interface isadapted to permit the selection, for observation, at least the followingmode associated with the multimedia content: a video portion thatincludes audio.
 9. The apparatus according to claim 8, wherein saidinput interface comprises: an arrangement for permitting the selection,for observation, of a video mode of multimedia content; and anarrangement for selectably adding audio to the video mode forobservation.
 10. A method of managing multimedia content, said methodcomprising the steps of: supplying multimedia content; permitting theselection, for observation, of at least one of the following modesassociated with the multimedia content: an audio portion that includesvideo; and a video portion that includes audio; and annotatingobservations of a selected mode.
 11. The method according to claim 10,wherein said step of permitting selection comprises permitting theselection, for observation, of both of the following associated with themultimedia content: an audio portion that includes video; and a videoportion that includes audio.
 12. The method according to claim 10,wherein said step of permitting selection additionally comprisespermitting the selection the selection, for observation, of solely avideo portion of multimedia content.
 13. The method according to claim10, wherein step of permitting selection comprises permitting theselection, for observation, of solely an audio portion of multimediacontent.
 14. The method according to claim 10, wherein said step ofsupplying multimedia content comprises providing a working memory whichstores multimedia files.
 15. The method according to claim 10, whereinsaid step of permitting selection comprises: first permitting theselection of a multimedia file and then permitting the selection of saidat least one of: an audio portion simultaneously with video; and a videoportion simultaneously with audio.
 16. The method according to claim 10,further comprising the step of providing a working memory for saving theannotated observations of a selected mode.
 17. The method according toclaim 10, wherein said step of permitting selection comprises permittingthe selection, for observation, at least the following mode associatedwith the multimedia content: a video portion that includes audio. 18.The method according to claim 17, wherein said step of permittingselection comprises: permitting the selection, for observation, of avideo mode of multimedia content; and thereafter enabling the additionof audio to the video mode for observation.
 19. A program storage devicereadable by machine, tangibly embodying a program of instructionsexecutable by the machine to perform method steps for managingmultimedia content, said method comprising the steps of: supplyingmultimedia content; permitting the selection, for observation, of atleast one of the following modes associated with the multimedia content:an audio portion that includes video; and a video portion that includesaudio; and annotating observations of a selected mode.